dnadna.datasets
Utilities for loading data from data sets, including support for different data set formats:
Classes for reading different dataset formats. Datasets are collections of SNP files for multiple scenarios, possibly with multiple replicates per scenario:
The
NpzSNPSource
class reads a data set of multiple parameter scenarios with (possibly) multiple replicates per scenario, stored in NPZ files in a particular filesystem layout, known as the DNADNA Format. This is the default data set format understood by DNADNA.The
DictSNPSource
class reads a JSON-based data set format which is less efficient both in terms of storage compactness and parsing/serializing, that allows plain-text storage of SNP data. Currently this is used primarily in testing.
The
DNATrainingDataset
and its simpler base classDNADataset
are implementations of a PyTorchDataset
used for loading SNP data (in the form ofSNPSamples
along with their associated scenario parameters, for both training sets and validation sets during model training. This works independently of what the dataset format is (the dataset format is implemented as anSNPSource
such as the two listed above, which is an abstract interface for arbitrary dataset formats). (TODO: There is currently noSNPSource
base class, but one should be implemented in order to help define the interface.)
Classes
|
Simplified base class for DNADNA datasets which simply maps an integer index to an |
|
|
|
Partially implemented |
|
SNP source that reads from a JSON-like data structure consisting of a dict with |
|
SNP source that returns scenarios from a fixed list of arbitrary files. |
|
SNP source that reads simulation data as |
A "SNPSource" is a class for loading |
Exceptions
|
Exception raised when a specified sample is not found in an SNP source. |
- class dnadna.datasets.DNADataset(config={}, validate=True, source=None, scenario_params=None, scenario_set=None, cached_set=None)[source]
Bases:
ConfigMixIn
,Dataset
Simplified base class for DNADNA datasets which simply maps an integer index to an
SNPSample
instance from the simulation dataset.This has two modes of operation: One where a
scenario_params
table is given as apandas.DataFrame
in the format described for the DNADNA Format. In this case, all the scenarios and replicates described in that table are returned (where they exist), and for each item in the dataset a(scenario_idx, replicate_idx, snp_sample, scenario_params)
tuple is returned.In the second mode of operation,
scenario_params
is not given, and the data sources are simply looped over directly. In this case a 4-tuple of(scenario_idx, replicate_idx, snp_sample, None)
is returned for each item.The
DNATrainingDataset
is the more complete implementation which can perform additional transformations on the data when used in model training, and which keeps separate training and validation sets.Given a
scenario_set=<scenario_idx>
argument, only the data in a single scenario are returned; this may also be a list/set of scenario indices to consider.- property cached_set
Indices whose samples should be cached in memory.
- config_schema = 'dataset'
The schema against which this class should validate its config
Config
by default.May be either the name of one of the built-in schemas (see
Config.schemas
) or a full schema object.
- classmethod from_config_file(filename, *args, validate=True, source=None, scenario_params=None, scenario_set=None, **kwargs)[source]
Load the
Config
from a file.Additional
kwargs
are passed tofrom_file
.The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.
- get(index, ignore_missing_replicates=None)[source]
Same as
DNATrainingDataset.__getitem__
but adds additional optional arguments.- Parameters:
index (index of the sample to get from the dataset) –
- Keyword Arguments:
ignore_missing_replicates (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the
ignore_missing_replicates
option in the dataset configuration, but this allows overriding the config file.
- class dnadna.datasets.DNATrainingDataset(config={}, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None)[source]
Bases:
DatasetTransformationMixIn
- config_schema = 'training'
The schema against which this class should validate its config
Config
by default.May be either the name of one of the built-in schemas (see
Config.schemas
) or a full schema object.
- classmethod from_config_file(filename, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None, **kwargs)[source]
Load the
Config
from a file.Additional
kwargs
are passed tofrom_file
.The additional keyword arguments are passed to the dict serializer, and the config is validated against the training schema.
- class dnadna.datasets.DatasetTransformationMixIn(config, transforms=None, param_set=None, **kwargs)[source]
Bases:
DNADataset
Partially implemented
Dataset
which accepts parameters for transforming the SNP data returned from the data source.- how to know if ubunt
- transforms`list`how to know if u
list
giving transform names or transform descriptions (a transform name plus its parameters) as specified in thedataset_transforms
property in the training config file. See also ref:schema-training
. May also contain instances ofTransform
.- param_set
ParamSet
ParamsSet
object representing all the details of the parameters to learn in training, including the values of those parameters for the training and validation sets (the pre-processed scenario params); information about the parameters can be used by some transforms.
Additional positional and keyword arguments are passed to
super().__init__()
so that this can be used as a mix-in with arbitraryDNADataset
subclasses.
- get(index, ignore_missing_replicates=None)[source]
Same as
DNATrainingDataset.__getitem__
but adds additional optional arguments.- Parameters:
index (index of the sample to get from the dataset) –
- Keyword Arguments:
ignore_missing_replicates (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the
ignore_missing_replicates
option in the dataset configuration, but this allows overriding the config file.
- get_split_set(split_type)[source]
Get the set of indices for the specified split type. Args:
split_type (str or list of str): The split type(s) to retrieve.
- Returns:
frozenset: A frozenset of indices for the specified split type(s).
- property test_set
Set of indices to use for testing.
- property training_set
Set of indices to use for training.
- property transforms
The composed set of transforms to apply to the dataset.
Either
dnadna.transforms.Compose
or a dict mapping dataset splits (“training”, “validation”, “test”) to their correspondingCompose
of transforms.
- property validation_set
Set of indices to use for validation.
- class dnadna.datasets.DictSNPSource(scenarios, position_format=None, filename=None, lazy=True)[source]
Bases:
SNPSource
SNP source that reads from a JSON-like data structure consisting of a dict with
(simulation, replicate)
pairs for keys, andSNPSamples
in JSON-compatible format for values (seeto_dict
).Currently used just by the test suite, but may be useful in other contexts as well (e.g. serialization of simulations).
- Parameters:
scenarios (dict) –
dict
with(simulate, replicate)
tuple keys, and values in the format output byto_dict
, or the values may also beSNPSample
instances (useful for testing).- Keyword Arguments:
position_format (dict) – (optional) – Position format dict corresponding to the
pos_format
argument toSNPSample
(currently all samples in the dataset are assumed to have the same position formats).filename (str) – (optional) – If the
scenarios
dict was read from a file (e.g. a JSON or YAML file) this can be set to the filename; this is used just as a convenience when reporting errors.lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not converted from the dict format until needed. Use
lazy=False
to ensure that the data is immediately converted.
Examples
>>> from dnadna.datasets import DictSNPSource >>> from dnadna.snp_sample import SNPSample >>> sample = SNPSample([[0, 1], [1, 0]], [0.1, 0.2]) >>> source = DictSNPSource({(0, 0): sample.to_dict()}, ... filename='scenario_0_0.json') >>> source.scenarios {(0, 0): {'SNP': ['01', '10'], 'POS': [0.1, 0.2]}} >>> (0, 0) in source True >>> source[0, 0] SNPSample( snp=tensor([[0, 1], [1, 0]], dtype=torch.uint8), pos=tensor([0.1000, 0.2000], dtype=torch.float64), pos_format={'normalized': True}, path='scenario_0_0.json' )
If the requested sample doesn’t exist in the dataset a
MissingSNPSample
exception is raised:>>> (0, 1) in source False >>> source[0, 1] Traceback (most recent call last): ... dnadna.datasets.MissingSNPSample: could not load scenario 0 replicate 1 from "scenario_0_0.json": KeyError((0, 1))
- name = 'dict'
The user-facing name of the plugin, which can be provided by a user implementing a plugin.
Typically it is automatically the same as the internal
Pluggable._name
but users are free to provide their own custom name here when implementing a plugin.
- plugin_url = 'py-obj:dnadna.schemas.plugins.snp_source.dict'
Base URL for all DNADNA plugins.
New plugins’ schemas can be found relative to this URL unless this attribute is explicitly overridden by the class implementing the plugin.
- class dnadna.datasets.FileListSNPSource(filenames)[source]
Bases:
object
SNP source that returns scenarios from a fixed list of arbitrary files.
Because the concepts of “scenarios” and “replicates” are not necessary applicable to an arbitrary list of files, each file is considered a single scenario of one replicate (e.g.
source[3, 0]
returns the contents of the fourth file in the list.
- exception dnadna.datasets.MissingSNPSample(scenario, replicate, path, reason=None)[source]
Bases:
Exception
Exception raised when a specified sample is not found in an SNP source.
- class dnadna.datasets.NpzSNPSource(root_dir, dataset_name, filename_format=None, keys=('SNP', 'POS'), position_format=None, lazy=True)[source]
Bases:
SNPSource
SNP source that reads simulation data as
SNPSamples
stored on disk in DNADNA’s native “dnadna” format.Each simulation is stored in a NumPy NPZ file containing two arrays, by default keyed by
'SNP'
for the SNP matrix, and'POS'
for the positions array.There is one
.npz
file for each replicate of each scenario, laid out in a filesystem format. The exact layout and filename can be specified by thefilename_format
argument to this class’s constructor, but the default layout is as specified inNpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT
, which is also the documented format assumed by the “dnadna” format.- Parameters:
root_dir (str, pathlib.Path) – The root directory of the DNADNA dataset. All filenames generated from the
filename_format
are appended to this directory.dataset_name (str) – The name of the dataset–same as that specified in the simulation config for this dataset.
- Keyword Arguments:
filename_format (str) – (optional) – A string in Python format string syntax specifying the format for filenames of individual simulations in this dataset. The format string can contain 3 replacement fields:
{dataset_name}
which is filled in with the model name given by thedataset_name
parameter above,{scenario}
which is filled with the scenario index, and{replicate}
which is filled with the replicate index. If the scenario and replicate indices are zero-padded in the filenames, the amount of zero-padding may be explicitly specified by writing the format string like{scenario:05}
(for scenario indices padded up to 5 zeros). However, if no-zero padding is specified in the format string, the appropriate amount of zero-padding is automatically guessed by filenames actually present in the dataset. Therefore the defaultfilename_format
,NpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT
can be used regardless of the amount of zero-padding used in a given dataset.keys (tuple) – (optional) – A 2-tuple of
(snp_key, pos_key)
giving the keywords for the SNP matrix and the position array in the NPZ file. The default('SNP', 'POS')
is the default for the “dnadna” format, but different names may be specified for these arrays.position_format (dict) – (optional) – The format of the position arrays in the dataset (currently all samples in the dataset are assumed to have the same position formats). Corresponds to the
pos_format
argument toSNPSample
.lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not read from disk until needed. Use
lazy=False
to ensure that the data is immediately loaded into memory.
Examples
>>> import numpy as np >>> from dnadna.datasets import NpzSNPSource >>> from dnadna.snp_sample import SNPSample >>> tmp = getfixture('tmp_path') # pytest-specific
Make a few random SNP and position arrays:
>>> dataset = {} >>> filename_format = 'my_model_{scenario:03}_{replicate:03}.npz' >>> for scenario_idx, replicate_idx in zip(range(2), range(2)): ... snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') ... pos = np.sort(np.random.random(10)) ... sample = SNPSample(snp, pos) ... filename = tmp / filename_format.format( ... scenario=scenario_idx, replicate=replicate_idx) ... sample.to_npz(filename) ... dataset[(scenario_idx, replicate_idx)] = sample
Instantiate the
NpzSNPSource
and load a couple samples:>>> source = NpzSNPSource(tmp, 'my_model', filename_format=filename_format) >>> source[0, 0] SNPSample( snp=tensor([[...], ... [...]], dtype=torch.uint8), pos=tensor([...], dtype=torch.float64), pos_format={'normalized': True}, path=...Path('...my_model_000_000.npz') ) >>> source[0, 0] == dataset[0, 0] True >>> source[1, 1] == dataset[1, 1] True >>> source[2, 0] Traceback (most recent call last): ... dnadna.datasets.MissingSNPSample: could not load scenario 2 replicate 0 from "...my_model_002_000.npz": FileNotFoundError(2, 'No file matching or similar to')
- DEFAULT_NPZ_FILENAME_FORMAT = 'scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz'
Default format string for filenames relative to the
root_dir
of anNpzSNPSource
.This is the default filesystem layout for the DNADNA format. Each scenario has its own directory named
scenario_<scenario_idx>
where thescenario_idx
is typically zero-padded the correct amount for the total number of scenarios in the dataset.Each simulation file in a scenario has the filename
<model-name>_<scenario_idx>_<replicate_idx>.npz
where bothscenario_idx
andreplicate_idx
are again zero-padded an appropriate amount.In a simulation config with the option
{"data_source": {"format": "dnadna"}}
, this default filename format can be overridden with the{"data_source": {"filename_format": "..."}}
option.
- classmethod from_config(config, validate=True)[source]
Instantiate an
NpzSNPSource
from a simulationConfig
matching the simulation schema.
- name = 'dnadna'
The user-facing name of the plugin, which can be provided by a user implementing a plugin.
Typically it is automatically the same as the internal
Pluggable._name
but users are free to provide their own custom name here when implementing a plugin.
- plugin_url = 'py-obj:dnadna.schemas.plugins.snp_source.dnadna'
Base URL for all DNADNA plugins.
New plugins’ schemas can be found relative to this URL unless this attribute is explicitly overridden by the class implementing the plugin.
- class dnadna.datasets.SNPSource[source]
Bases:
Pluggable
A “SNPSource” is a class for loading
SNPSample
objects from some data source.Subclasses of this class represent different data formats from which samples can be loaded.
This is in a way “lower-level” than
DNADataset
.DNADataset
is an abstraction that loads SNPSamples from a data source, possibly performs some transforms on them, and returns them. From the point of view ofDNADataset
the actual on-disk format from which the samples are read is abstracted out toSNPSource
.In fact it may not even be an “on-disk” format; for example one could implement a
SNPSource
plugin that loads samples from an S3 bucket.The “main” implementation of
SNPSource
isNpzSNPSource
which loads samples organized on disk in the “dnadna” format. The other built-in implementations include:FileListSNPSource
– a simple format that simply reads a list ofSNPSample
s from a list of filenames; this is used primarily by thednadna predict
command for reading in a list of files on which to make predictions.DictSNPSource
– used primarily for testing, it can read samples from a JSON-compatible dict format; see its documentation for more details.
- classmethod from_config(config, validate=True)[source]
Instantiate an
SNPSource
from datasetConfig
matching the dataset schema.Although configuration specific to a given
SNPSource
subclass may have its own format-specific schema, these are still passed the full dataset config, which may contain additional properties (such asdata_root
) that might be useful to a given format.Subclasses should implement this method in order to specify how to instantiate it from a config file; otherwise it cannot be used as a configurable plugin.
- classmethod from_config_file(filename, validate=True, **kwargs)[source]
Like
from_config
but given a filename instead of aConfig
object.The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.
- name = 'snp_source'
The user-facing name of the plugin, which can be provided by a user implementing a plugin.
Typically it is automatically the same as the internal
Pluggable._name
but users are free to provide their own custom name here when implementing a plugin.
- plugin_url = 'py-obj:dnadna.schemas.plugins.snp_source'
Base URL for all DNADNA plugins.
New plugins’ schemas can be found relative to this URL unless this attribute is explicitly overridden by the class implementing the plugin.